Using Base R built-in data set, mtcars, we will display
various plots.
plot(mtcars$mpg)
The above plot uses the mpg variable of the
mtcars data set. When a single variable is provided, it is
plotted as the Y-axis. The plot does not provide informative insights,
but instead showcases how easy it is to visualize a plot using R.
Using a scatter plot for two variables can help give insight to the
relationship between the two variables. Below will be a scatter plot of
wt versus mpg (weight vs. miles per
gallon).
plot(mtcars$wt, mtcars$mpg)
Looking at the scatter plot above, it looks as if the miles per gallon of a car decreases as the weight of the car increases.
When two variables are provided, the first variable given by default
will become the X-axis and the second variable will be the Y-axis. We
can specify the order of the parameters explicitly by declaring it with
its supplied argument or variable. For example,
plot(y = mtcars$mpg, x = mtcars$wt). The first variable
given becomes the Y-axis and the second variable becomes the X-axis.
By default, plot() will create a scatter plot, but if we
were to provide a different argument to the type parameter,
we can get a different plot. For example, we can instead plot a line
chart by implementing type = "l".
plot(mtcars$mpg, type = "l")
plot(mtcars$wt, mtcars$mpg, type = "l")
As we can see above, the usage of a certain chart sometimes might not make sense. This is because the line is drawn based on the order of the observation point. It would be more insightful to use linear regression to create predictions. Linear regression will be discussed in another reference document.
The bar plot will use the cyl variable from the
mtcars data set. cyl refers to the amount of
cylinder in the cars.
barplot(table(mtcars$cyl))
The table() function was used to aggregate the number of
cylinders. Let’s try using it on it’s own.
table(mtcars$cyl)
##
## 4 6 8
## 11 7 14
There are 11 cars with 4 cylinders, 7 cars with 6 cylinders, and 14 cars with 8 cylinders.
Using a histogram can give us some insight to the distribution of data.
hist(mtcars$mpg)
We can see that majority of the cars within the mtcars
data set can run between 15 to 20 miles per gallon of gas, while very
few can run between 25 to 30 miles per gallon.
We can use a box plot to display the range, 1st quantile, median, and
3rd quantile. Below will be a box plot for the mpg variable
of the mtcars data set. The horizontal
argument will be set to TRUE to display the box plot
horizontally rather than vertically, which is the default.
boxplot(mtcars$mpg, horizontal = TRUE)
The following will use the scatter plot of wt versus
mpg.
The plot can be customized to suit your preferences such as adding a title, axis labels, and changing the color of the data marker, title, or labels.
Base R provides us with enough to make plots with simple
customization. Next, we will showcase ggplot2, a package
from the tidyverse collection that will allow us to have
further control over plot customization.
ggplot2 for VisualizationThe ggplot2 package allows us to create beautiful
visualization by creating a base visualization that we can add on to.
For example, we will first create a simple scatter plot of
wt versus mpg using ggplot2. Then
we will add on to it. Before creating the ggplot, we first need to load
the required library.
library(ggplot2)
ggplot2 Scatter PlotAfter loading the ggplot2 package, we can start creating
the base of a scatter plot.
ggplot(data = mtcars, aes(x = wt, y = mpg)) +
geom_point()
We used the ggplot() function to create the ggplot to
which we provided mtcars as the data argument. You would
provide the data point for the x- and y-axis in the aes()
portion of the code. With that, the ggplot knows what data it is working
with and which variables will be used. The + symbol is used
to add on to that. With the example above, we wanted to have the data
become points essentially representing a scatter plot.
Comparing the simple ggplot2 scatter plot to the one
made with base R, we can already see that it looks a lot better. Let’s
continue making it look even better.
ggplot2 Scatter Plot With Regression LineNote that we can add on many other things to the plot by using more
+ following addition desired features. For example, let’s
add a linear regression line to the scatter plot by using
geom_smooth(method = "lm").
## `geom_smooth()` using formula = 'y ~ x'
The line that runs through the scatter plot is the linear regression
line that gives us an understanding of the relationship between the two
variable. This line can also be used to make predictions on new data.
The area that is dark green displays the confidence interval for the
regression line. This is displayed because the se parameter
is defaulted to TRUE. You can turn it off using
se = FALSE.
ggplot2 Bar PlotFor the x argument, since the cyl variable
contain only 3 possible values, the values of the cyl
variable should be categorical, thus we need to factor the variable. We
will further customize the above plot by coloring each bar a different
color.
ggplot2 HistogramWe can see that this histogram looks similar to the one using the
base R function hist(), but a lot nicer.
ggplot2 Box PlotWe can also apply a color filling by categorical group. For example,
the following plot will display box plots for mpg but
separated by the number of cylinders.
We could also have applied a facet grid.
Using the map_data() function from the
ggplot2 package, we will create a map of the United States
to which each state is filled in with its own color.
The map above does not give much or any information at all, but it sets up as a starting template for geo-spatial data (specifically the United States in the example). The following map visualization will be using a data set provided by CSU East Bay, STAT 541 - Intro. Data Visualization. First, let’s have a brief look into the data set.
library(dplyr)
library(tibble)
Note: The dplyr is included in the
tidyverse collection and is a great package for data
manipulation thanks to how it is grammatically structured when used.
## # A tibble: 1,269 × 17
## id name city state region highest_degree control gender admission_rate
## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 102669 Alask… Anch… AK West Graduate Private CoEd 0.421
## 2 101648 Mario… Mari… AL South Associate Public CoEd 0.614
## 3 100830 Aubur… Mont… AL South Graduate Public CoEd 0.802
## 4 101879 Unive… Flor… AL South Graduate Public CoEd 0.679
## 5 100858 Aubur… Aubu… AL South Graduate Public CoEd 0.835
## 6 100663 Unive… Birm… AL South Graduate Public CoEd 0.857
## 7 101480 Jacks… Jack… AL South Graduate Public CoEd 0.833
## 8 102049 Samfo… Birm… AL South Graduate Private CoEd 0.595
## 9 101709 Unive… Mont… AL South Graduate Public CoEd 0.743
## 10 100751 The U… Tusc… AL South Graduate Public CoEd 0.510
## # ℹ 1,259 more rows
## # ℹ 8 more variables: sat_avg <int>, undergrads <int>, tuition <int>,
## # faculty_salary_avg <int>, loan_default_rate <chr>, median_debt <dbl>,
## # lon <dbl>, lat <dbl>
Dimension: \(1,269 \times 17\).
Variables:
Now to correct the data types of the variables and aggregate the data to count the number of colleges within a specific state and region.
# Correcting the data type of the variables.
college <- college %>%
mutate(state=as.factor(state),
region=as.factor(region),
highest_degree=as.factor(highest_degree),
control=as.factor(control),
gender=as.factor(gender),
loan_default_rate=as.numeric(loan_default_rate))
college_summary <- college %>%
group_by(state, region) %>%
summarise("School Count" = n())
college_summary
## # A tibble: 51 × 3
## # Groups: state [51]
## state region `School Count`
## <fct> <fct> <int>
## 1 AK West 1
## 2 AL South 24
## 3 AR South 16
## 4 AZ West 6
## 5 CA West 71
## 6 CO West 14
## 7 CT Northeast 14
## 8 DC South 6
## 9 DE South 3
## 10 FL South 36
## # ℹ 41 more rows
Using the data, we will plot a map of the United States with each state being color-coded based on the number of colleges it has.
A beautiful plot can give use a lot of information and can be used to determine relationships, trends, clustering, and more, but implementing animation can bring the story to life.
We will use the gapminder data set from the
gapminder package to visualize some animated plots. Let’s
briefly look into the gapminder data set.
library(tibble)
library(gapminder)
as_tibble(gapminder)
## # A tibble: 1,704 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ℹ 1,694 more rows
Dimension: \(1,704 \times 6\).
Variables:
gganimate is the required package.
gganimate Scatter PlotThe size of the scatter point is associated with the population size of the country. There is no meaning behind the colors of the data points. It can be see that through the years, as the log of the GDP Per Capita increases, there looks to be an association with an increase to life expectancy.
gganimate Line ChartClearly, the population of all the continents pale in comparison to Asia’s.
gganimate Bar Chart With Shadow Markgganimate Box PlotData visualization helps us see data in a way that makes it easy for us to understand. It allows use to see the story that the data is telling. We can derive information from them and use that to make better decisions. This reference document will continue to be updated as I learn more about data visualization.